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Abstract — In this paper, we first propose a novel common 
subexpression elimination (CSE) algorithm for matrix-vector mul- 
tiplications over characteristic-2 fields. As opposed to previously 
proposed CSE algorithms, which usually focus on complexity 
savings due to recurrences of subexpressions, our CSE algorithm 
achieves two types of complexity reductions, differential savings 
and recurrence savings, by taking advantage of the cancelation 
property of characteristic-2 fields. Using our CSE algorithm, 
we reduce the additive complexities of cyclotomic fast Fourier 
transforms (CFFTs). Using a weighted sum of the numbers of 
multiplications and additions as a metric, our CFFTs achieve 
smaller total complexities than previously proposed CFFTs and 
other FFTs, requiring both fewer multiplications and fewer ad- 
ditions in many cases. 

Index Terms — Common subexpression elimination (CSE), Com- 
plexity theory. Convolution, Discrete Fourier transforms (DFTs), 
Galois fields. Multiple constant multiplication (MCM), Reed- 
Solomon codes. 



I. Introduction 

Discrete Fourier transforms (DFTs) over finite fields have 
widespread applications in error correction coding [1]. For 
Reed-Solomon codes, all syndrome-based bounded distance 
decoding methods involve DFTs over finite fields [1]: syn- 
drome computation and the Chien search are both evaluations 
of polynomials and hence can be viewed as DFTs; inverse 
DFTs are used to recover transmitted codewords in transform- 
domain decoders. Thus efficient DFT algorithms can be used 
to reduce the complexity of Reed-Solomon decoders. For ex- 
ample, using the prime-factor fast Fourier transform (FFT) 
in [2], Truong et al. proposed [3] an inverse-free transform- 
domain Reed-Solomon decoder with substantially lower com- 
plexity than time-domain decoders; FFT techniques are used 
to compute syndromes for time-domain decoders in [4]. 

Using an approach similar to those in previous works (see, 
for example, [5]), cyclotomic FFT (CFFT) was recently pro- 
posed [6] and two variants were subsequently considered [7], 
[8]. To avoid confusion, we refer to the CFFT proposed in [6] 
as direct CFFT (DCFFT) and those in [7] and [8] as inverse 
CFFT (ICFFT) and symmetric CFFT (SCFFT) respectively 
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henceforth in this paper DCFFT has been shown to be efficient 
for full DFTs of lengths up to 51 1 [6], and ICFFT and SCFFT 
are particularly suitable for partial DFTs, which compute only 
part of the spectral components and are important for such op- 
erations as syndrome computation of Reed-Solomon decoders 
[7], [8]. 

Although CFFTs in [6]-[8] achieve low multiplicative com- 
plexities, their additive complexities (numbers of additions 
required) are very high if implemented directly. The meth- 
ods used in [6]-[8] somewhat alleviate the problem, but the 
additive complexities of CFFTs in [6]-[8] remain quite high. 
In this paper, we first propose a novel common subexpression 
elimination (CSE) algorithm, and then use it to reduce the 
additive complexities of various CFFTs. The contributions of 
this paper are: 

• To minimize the additive complexities of CFFTs is a spe- 
cial case of the well-known collection-of-sums problem, 
which is NP-complete [9], [10]. Aiming to reduce ad- 
ditive complexities, previously proposed CSE algorithms 
focus primarily on identifying recurring subsets of sum- 
mands (we refer to this as subexpressions or patterns). 
In contrast, our CSE algorithm, which has only poly- 
nomial complexity, also takes advantage of two other 
types of complexity reductions enabled by the underlying 
characteristic-2 fields: in addition to explicit recurring 
subexpressions mentioned above, our CSE algorithm also 
considers implicit subexpressions for additional savings; 
since the difference between two sums may require fewer 
additions than one of the two sums, our CSE algorithm 
also captures savings of this type. 

• We investigate the properties of the three types of CFFTs 
mentioned above and establish the relations among them. 
We first show that the three types of CFFTs have the 
same multiplicative complexities assuming the same bi- 
linear forms. Furthermore, we establish that, under direct 
implementation, all three types of CFFTs have the same 
additive complexities. Finally, we show that there is a 
mapping between SCFFTs and ICFFTs that preserves 
the additive complexities regardless of implementation. 
Thus, from the perspective of both multiplicative and ad- 
ditive complexities, SCFFTs and ICFFTs are equivalent. 
Our results simplify the analysis of their multiplicative 
and additive complexities as well as performance com- 
parison. 

• Using our CSE algorithm, we reduce the additive com- 
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plexities of full CFFTs greatly. In comparison to the full 
CFFTs in [6]-[8], the best results to our knowledge, our 
CFFTs have 4%-15% smaller additive complexities while 
maintaining the same multiplicative complexities. Com- 
pared to some previously proposed FFTs techniques, our 
CFFTs require fewer multiplications and fewer additions. 
In comparison to some other FFTs techniques, our CFFTs 
require fewer multiplications but more additions; in such 
cases, the total complexities, obtained by assuming that 
a multiplication over GF(2"') is as complex as 2m — 1 
additions, of our CFFTs are smaller. 
The rest of the paper is organized as follows. In Section HIl 
we briefly review various CFFTs and CSE algorithms to make 
this paper self-contained. Section |III] presents our CSE algo- 
rithm. We investigate the properties of and relations among 
the three types of CFFTs in Section |lVl CFFTs with reduced 
additive complexities are obtained by using our CSE algorithm 
and presented in Section |V] 

II. Background 

A. Cyclotomic FFTs 

Given a primitive element a £ GF(2'™), the DFT of a vector 
/ = (/o, /i, . . . , /„_i)^ is defined as F ^ (/(««), /(a^), . . . , 
/K-i))^, where f{x) ^ T.Zo ^ GF(2™)[x0 A 

new cyclotomic FFTs algorithm was proposed in [6], and 
for short lengths (up to 511 [6]) it is computationally effi- 
cient. Representing f{x) as a sum of linearized polynomials 
by cyclotomic decomposition [6], [8], cyclotomic FFT F = 
ALf' = ALIlf, where A is an n x n binary matrix, L ~ 
diag(io, Li, . . . , Li-i) is a block diagonal matrix with square 
matrices Li's on the diagonal, I is the number of cyclotomic 
cosets, /' = (/(T, /f", • • • , is a permutation of the 
input vector /, and IT is a permutation matrix. Suppose Li 
corresponds to a coset of size m^, using a normal basis of 
GF(2™') generated by 7^, then Li becomes a circulant matrix 
[11]: 



L, = 



(1) 



Henceforth in this paper we assume Li's in L are always 
constructed by normal bases and we say Li in ([U is a circulant 
matrix generated by 7^. Thus the product of Li and can 
be computed as a cyclic convolution, for which fast bilinear 
form algorithms are available [12]-[15]. These fast algorithms 
can be written in matrix form as — Qi{Ribi ■ Pif'i) — 



Qti^r-Prft), where b, = (7i:,7j , ■ 



, Qi, R^, and 



Pi are binary matrices, Ci — Ribi is a precomputed constant 
vector, and • stands for pointwise multiplications. Combining 
all the terms, a DCFFT is given by F = AQ{c Pf'), where 
Q and P are both block matrices, for which the blocks off 
the diagonal are the zero matrices and the diagonal blocks are 
Qi's and Pi's respectively, and c = {cq , cf , . . . , cf_^)'^. We 

' In this paper, vectors and matrices are represented by boldface letters, and 
scalars by normal letters. 



remark that both Q and P are binary and usually sparse. For 
details of CFFTs, please refer to [6]. 

Two variants of CFFTs were proposed in [7], [8]. First, 
by using the same permutation for both F and /, SCFFTs 
proposed in [8] satisfy F' = L'^ A!^ f , where F' = UF 
and /' = n/. SCFFTs are so named because they have 
symmetric transform matrices, that is, L'^A''^ — A'L. It is 
easy to deduce that A' = UA. ICFFTs, proposed in [7], are 
based on inverse DFTs and satisfy F" = L^ A^ f, where 
F" is also a permutation of F. Both SCFFTs and ICFFTs 
require fewer multiplications than DCFFTs for partial DFTs, 
where only a subset of components in F are needed. 

The multiplicative complexity of each CFFT, i.e., the num- 
ber of multiplications required, is the total number of non- 
trivial scalar multiplications in all cyclic convolutions. That 
is, the multiplicative complexity of c, • Pif'i is the number 
of non-one elements in c, (no element is zero in c^), which 
is determined by the cyclic convolution algorithms. To find 
the optimum cyclic convolution algorithms with the minimum 
multiplicative complexities in CFFTs is still an open problem. 
In this paper, we use the cyclic convolution algorithms in [16]. 

The additive complexity of each CFFT is determined by the 
two matrix-vector multiplications in which both matrices are 
binary. For example, in DCFFTs, the matrices are AQ and 
P. Due to the large size of AQ, direct computation of the 
matrix-vector product will result in high additive complexity. 
A heuristic algorithm based on erasure decoding [17] was used 
in [6] to reduce the additive complexity. Similar optimization 
was also used in [7]. Another fast matrix-vector multiplication 
algorithm is the Four Russians' algorithm [18], but it is based 
on preprocessing and fails to efficiently exploit the matrix 
structure. CSE is another commonly used technique for fast 
matrix-vector multiphcation. 

B. Common Subexpression Elimination 

Consider a linear transform Y = MX, where Y and X 
are n- and n'-dimensional column vectors, respectively, and 
M is an 77, X n' matrix containing only 1,-1, and 0. Clearly, 
such a transform requires only additions and subtractions. It 
was shown that it is an NP-complete problem [9, Ensemble 
Computation], [10, Collection of Sums] to minimize the num- 
ber of additions and subtractions. 

A special type of the collection-of-sums problem is the 
MCM problem [19], where the relative position of a bit pattern 
within the matrix is of no importance [20]. This is a valid 
assumption in the case of the X ~ {c^xq, c^a;oi • • ■ , c"~^a;o)"^ 
with c = 2 or c = 2^^, which is common in filters. Thus, 
patterns that differ in relative positions only can be obtained 
from one of them by shift operations. This class of problems 
have wide applications in finite impulse response (FIR) fil- 
ters [19]-[25]. Graph-based algorithms [24], [25] synthesize 
directed acyclic graphs, in which partial sums define nodes and 
shifts are annotated on edges. In [24], optimal solutions can 
be obtained by exhaustive search of all topologies with high 
computational complexity. Entropy and conditional entropy 
are used in [25] for vertex decomposition. Pattern-based algo- 
rithms [19]-[23], [26], [27] reduce the MCM complexity by 
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first identifying recurring patterns, which are combinations of 
non-zero positions, and then calculating them only once. They 
usually use canonical signed digit (CSD) representation to 
identify potentially sharable subexpressions. The algorithms in 
[19], [21]-[23], [26], [27] use weight-2 subexpressions as the 
primitive elements. In contrast, [20] searches for the highest- 
weight common subexpressions. 

To minimize the number of additions in the matrix-vector 
multiplications in CFFTs over characteristic-2 fields consti- 
tutes a different type of the collection-of-sums problem, where 
X is over characteristic-2 fields. This implies two properties: 
(1) the summands are independent, and (2) the I's in M are 
equivalent to —1 and additions are equivalent to subtractions. 
The second property is noted but not utilized in [19]. Due to 
the two properties, CSE techniques for the MCM problem are 
not suitable for the problem considered in this paper. Heuristic 
CSE techniques proposed in the context of MCM problem 
result in modest complexity reductions since they do not take 
advantage of the second property (see, for example, [19]). 
For the first property, sophisticated CSE techniques, especially 
those relying on CSD representation, that are tailored for FIR 
filters (see, for example, [20]-[25]) cannot be directly ap- 
plied to our problem; to adapt these algorithms to our problem 
is not straightforward and requires nontrivial research efforts. 
Hence, we do not consider CSE techniques for the MCM 
problem in this paper 

Two CSE algorithms that account for the cancelation prop- 
erty were proposed in [6], [28], and we will compare our CSE 
algorithms against these. 

III. A Novel CSE Algorithm over 
Characteristic-2 Fields 

We propose a novel CSE algorithm with polynomial com- 
plexity that significantly reduces the additive complexities of 
CFFTs. Although our CSE algorithm does not guarantee to 
minimize the additive complexities, it may do so in some 
cases, especially when the size of the problem is small. 

Let us establish the terminology to describe our CSE algo- 
rithms. For a matrix-vector multiplication Y — MX, where 
Y = {Yo,Yi,...,Y^_if and X = (Xq, Xi, . . . , X„-_i)^ 
are n- and n'-dimensional column vectors and M is an ti x n' 
matrix, we refer to the components in Y as sums and the 
components in X as summands. Note that the sums in Y have 
one-to-one correspondence with the rows in M, and in direct 
computation the number of additions required to compute a 
sum is the number of ones in its corresponding row minus 
one. Hence, with a slight abuse of terminology, we sometimes 
use rows and sums in an exchangeable manner Similarly, there 
is a one-to-one correspondence between the summands and the 
columns in M, and we sometimes use columns and summands 
in an exchangeable fashion below. 

Our CSE algorithm achieves two kinds of savings: differ- 
ential savings and recurrence savings, as defined in Sections 
IIII-AI and FlII-BI respectively. 

A. Differential Savings 

Let Y — MX represent a matrix-vector multiplication, in 
which X is over characteristic-2 fields and M is a binary 



matrix. For the column positions where M,, and M,,^, rows 
Tp and Tc (rp / r^) of M respectively, both have ones, the 
difference (or sum) Mr^ — Mr of the two rows has zeros. If 
Mr^ — Mr contains fewer entries than one of the two rows, 
say Mr^, we can reduce the total number of additions by 
first computing Yr and then computing Yr^ =Yr+ {Mr^ — 
Mr )X. Let us denote the numbers of non-zero entries in 
Mr J,, Mr^, and Mr^ ~ Mr^ as Wp, Wc, and Wd, respectively, 
the differential saving (the number of additions saved) is given 
hy Wc ~ Wd ~ I- Since we are only concerned about positive 
savings, we use {wc — Wd — 1)^ = max{0, Wc — Wd — 1} in 
our algorithms. 

The price for the differential saving is that now Yr must 
be computed before Yr^, putting a dependency between the 
two sums. We use an ordered pair {rp,rc) to represent this 
dependency; we call Yr , the sum computed first, the par- 
ent, and refer to Yr^ as the child. Since each ordered pair 
introduces a dependency, to keep track of all dependency, we 
use a digraph to keep track of all ordered pairs, where the 
vertices are the row numbers in the ordered pairs and the 
edges are from the parent to the child in each pair We call 
this graph dependency graph henceforth in this paper There 
is no conflicting dependency as long as the dependency graph 
is acyclic. Thus, before any ordered pair can be added to 
the dependency graph, it is necessary to check whether the 
addition of the new ordered pair will introduce cycles in the 
dependency graph; if yes, this ordered pair is called cycle- 
inducing and hence not permissible. Cycle detection can be 
done recursively. 

When an ordered pair {rp,rc) is added to the dependency 
graph, both M and X need to be transformed. We first append 
Yr to X as a new summand. We also replace Mr^ with the 
difference Mr^ — Mr ; then due to the new summand Yr , a 
new column with a single one at the Tcth position and zeros 
at other positions is appended to M. We call these operations 
a differential transformation. 

Our differential transformations bear some similarities to 
the erasure correction approach used in [17]. As pointed out 
in [17], Y = MX is equivalent to [M \ I]{X'^ ,Y^Y = 0, 
which defines a code C with all codewords (X^,l^^)^; to 
compute Y — MX is equivalent to erasure correction with Y 
erased based on C. After a series of differential transformations 
as described above, the matrix-vector multiplication becomes 

Y = M'X', where X' = {X'^,Y'^f, Y' consists of 
the summands corresponding to all the parents in the ordered 
pairs, and M' has the same number of rows as M. By adding 
all-zero columns to M\ we can find a matrix M" such that 

Y - M"iX^,Y^f. Hence (M" - [0 | = 
0. Thus our differential transformations lead to a different 
parity check matrix for the same code C. Furthermore, the 
acyclic property for the dependency graph ensures that Y can 
be recovered by using the parity check matrix M" — [0 | /]. 
From this perspective our differential transformation is similar 
to that of the message passing part of [17]: both find an 
alternative parity check matrix with smaller Hamming weights 
for the code C, which can be used to compute Y. However, 
different search methods are used to obtain alternative parity 
check matrices in our work and in [17]. 
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B. Recurrence Savings 

We refer to the number of occurrences of a subexpres- 
sion (or pattern in the rows of M) as pattern frequency, and 
define the recurrence saving of each pattern as its pattern 
frequency minus 1. After a subexpression is identified, we 
append the subexpression to X as a new summand, and M 
is updated accordingly. These operations are referred to as 
a recurrence transformation. A sequence of recurrence trans- 
formations can be described in a matrix decomposition form: 
M = Mn Ilfjo^ T,, where = [/ | GJ]'^, the row vector 
Gj corresponds to a subexpression, Mu has no pattern recur- 
rence, and K is the number of identified subexpressions. Thus 
Y is computed in a sequential fashion: first assign 
then compute = T,x('' for z = 0, 1, . . . , X - 1, and 

finally compute Y = MbX''^\ For < < iiT- 1, let M^'^ 
denote MnHf^^ Ti, and Y = m'-'^x'-'K 

Compared with the matrix splitting method [20], recurrence 
transformations keep track of the identified subexpressions as 
new summands, instead of simply removing them. To reduce 
the computational complexity of pattern search, our CSE algo- 
rithm looks for only two-summand subexpressions. However, 
since each two-summand subexpression is in turn appended 
as a summand and multi-suimnand subexpressions can be ex- 
pressed recursively as two-summand subexpressions, our CSE 
algorithm efficiently exploits the recurrence savings of both 
two-summand patterns and multi-summand patterns. 

One limitation of the recurrence transformations above is 
that it considers only exphcit subexpressions, missing implicit 
subexpressions that are hidden by cancelation. We will now 
identify implicit subexpressions through forced patterns. To 
this end, after a two-summand pattern Xq + Xi is identified 
and introduced as a new summand X„, we try to impose the 
pattern on the rows containing only Xq or Xi by replacing 
Xq with Xi+Xn or Xi with Xo + Xn- After forcing patterns 
Xi+Xn or Xo+Xn on row Vi, if previously identified patterns 
emerge due to cancelation and therefore lead to complexity 
savings, we transform to reflect the forced pattern. If the 
forced pattern does not lead to any saving, we do not transform 
Mn ■ Since a forced pattern leads to complexity saving only 
when they match previously identified patterns, we search the 
rows only for previously identified patterns. Since we keep 
track of all two-summand patterns, we first search the rows 
for previously identified patterns that include Xq or Xi, which 
is inserted due to the forced pattern. If we find a previously 
identified pattern, say Xj = Xi + Xq, in row r^, we replace 
Xi + Xq by Xj and continue to search for all previously 
identified patterns that include Xj, and so on. 

Now we illustrate the advantage of the forced pattern method 
by a simple example. Say we have established three patterns 
as Xi = X1+X2, X5 = X3+Xi, and Xq = Xq+Xi. Now 
let us consider the sum Yq = Xq + X2 + X^, which does not 
contain the identified patterns X4, X5, or Xq exphcitly. But if 
we force Xq on 1o> we have Yq = Xi+X^ + X^ + Xq, which 
becomes Yq = X5 + Xq after replacing previously identified 
patterns Xi + X2 with X4 and X3 + X4 with X5 as described 
above. In this simple example, by forcing the pattern we reduce 
the number of additions by one. In a nutshell, it is a greedy 



strategy in which, based on existing subexpressions, we try to 
find an alternative expression that requires fewer additions for 
a sum. 

When introducing forced patterns for a sum, new summands 
for the sum are introduced. If any new summand is a sum, this 
introduces dependency between the two sums, and possibly 
cycles in the dependency graph. We replace Xi with Yi in 
the simple example above to illustrate such a case. If we force 
the pattern Xq = Xq + Yi pattern on Yq, we have Yq = 
Y1+X2+X3+XQ = X^+Xq. Although it reduces the number 
of additions by one, it requires that Yi should be computed 
before Yq. Since forced patterns introduce new dependency, 
we will keep track of this using the dependency graph and 
cycle detection is necessary in recurrence transformations if 
we consider forced patterns. 

C. Approximate Dynamic Programming 

We have discussed two kinds of transformations that result 
in differential savings and recurrence savings. A remaining 
question is: how should we coordinate the transformations 
associated with differential savings and recurrence savings? 
That is, which kind of saving is more preferable? A seem- 
ingly straightforward answer would be to use a simple greedy 
strategy: choose one transformation with the greatest saving. 
Instead of this simple greedy strategy, we adopt a different 
strategy. We justify our choice by approximate dynamic pro- 
gramming [29] below. 

Note that both differential and recurrence transformations 
can be expressed in a matrix decomposition form. Thus the 
coUection-of-sums problem can be viewed as a dynamic pro- 
gramming problem [29], where the cost to be minimized is 
the number of additions and each differential or recurrence 
transformation corresponds to one stage. The total cost is de- 
noted hy A = E^o^ff^*^ + Jr where ^f') G {0,1} is the 
cost of Stage i and Jji is the cost of implementing Mji. 
Let us denote M and X after the ith stage as the Ai'*^*^ 
and X'-^\ and they are the state variables. The idea of ap- 
proximate dynamic programming is to approximate and op- 
timize the cost-to-go J [29] . Suppose after the transforma- 
tions in Stage i, the matrix-vector multiplication is given by 
Since under direct computation, it needs 
VF(M**^) - n additions, where W^(M'^'') is the number of 
I's in M'-'\ we use = aW(W(M(*)) - n) as a lin- 
ear approximation of the cost-to-go, where a'^^^ approximates 
{A - Ej=o5^'') / {W{M^''^) - n). When a« = - 

E;=U^^'^) / (W^(MW) - n), JW = A- E;=o5^^'^ is 
indeed the cost-to-go. Suppose for Stage i, the largest differ- 
ential and recurrence savings are and Sr , respectively. 
Based on the above approximation, we can find a transforma- 
tion that minimizes the cost-to-go. If a differential transfor- 
mation is chosen, the matrix weight after the transformation 
is given by W{M'^'+^^) = W{M^'^) - sf; otherwise, it is 
W(M(*+i)) = VK(M(^)) - s^'^ - 1. Then the approximate 
optimal cost-to-go is the smaller between a*^'^ ■ (VK(Af '•*•') — 
s'-P - n) and 1 + a'^^ • (W{M^'^) - s'r^ - 1 - n). Thus 

(i) (i) 

a differential transformation is preferred when s]^' > sf- + 
1 - 1/oW, and a recurrence transformation is preferred if 
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sl^^ < sl'^ + 1-1 /a(*). Although it is difficult to compute a'-'^ 
since A is actually unknown, fortunately the choice between 
differential saving and recurrence saving does not require the 
precise value of A. It is obvious that < a^'^ < 1 for any 
i, and hence differential transformations are usually preferred 
over recurrence transformations even when s^'^ = si^^ . This is 
particularly the case when a(*) is small. For example, the ratio 
of the required number of additions after applying our CSE 
algorithm, and W{M^°^)-nis between 0.16 and 0.26. Thus, 
a*^°^ is clearly a small fraction. As i increases, a^*^ increases 
while s^'^ and Sr*^ decrease. Our CSE algorithm treats the 
differential transformations with preference in all cases. 

We comment that the simple greedy strategy which selects 
the greater one between s^p and si mentioned above cor- 
responds to always setting a^*^ = 1 in approximate dynamic 
programming, which does not provide a good approximation. 
Our simulation results confirm this observation, as the differ- 
ential saving first strategy usually leads to better results than 
the simple greedy strategy. 

Since we are using approximate dynamic programming in 
every stage, choosing a differential saving does not take into 
account all recurrence savings in future stages. Thus for some 
stages it may be an unwise choice. We propose a method to 
identify such differential savings and reverse them. Say Yo = 
X()+Xi+X2+ Xs and Yi = + X4 + X5 + Yq as a result 
of the differential saving from an ordered pair (0, 1). Since 
Yq + Xq = Xi + X2 + X3, we can replace Xq + Yq in Yi by 
Xi + X2 + X3 and it is clear that Yq and Yi have a common 
subexpression Xi + X2 + X3. Using the subexpression Xi + 
X2 + X3 effectively reverses the differential transformation 
represented by (0,1). To identify a reversal of this kind, we 
search for reversal patterns; a reversal pattern Yi+Xj consists 
of a sum Yi and one of its summands Xj. In contrast to other 
patterns, this pattern may have a recurrence saving of zero, 
that is, it appears only once. It can be shown that such a 
reversal saves only one addition, regardless of the frequency 
of the reversal pattern; thus, such a reversal is meaningful 
only when there are no other subexpressions involving Yi. For 
instance, in the above example, if there are more than two 
recurrences of Yq + X4, the subexpression Yq + X4 results 
in a greater saving than Yq + Xq. Thus it will be efficient to 
search for reversal patterns only after all recurrence savings 
are accounted for. 

Our CSE algorithm, shown below in Algorithm [T] has two 
major steps. Steps [T]l and[T]3, and they are referred to as the 
differential and recurrence steps respectively. 

Algorithm 1. Common Subexpression Elimination 

[T] 1 Identify the non-cycle-inducing pairs of rows with the Id 
greatest differential savings, select one pair out of them 
randomly, and transform both M and X as described 
above. 

[T]2 Repeat Step[T]l until there is no differential saving. 

[T]3 Identify the two-summand patterns with the Ir greatest 
recurrence savings, select one out of them randomly. 
Replace all occurrences of the selected pattern with a 
new entry. On those rows with only one entry of the 
pattern, force the pattern if it leads to less I's in the 



row. 

[T|4 Go to Step [U 1 until there is no recurrence saving. 
[T|5 If there is a reversal pattern, reverse the differential sav- 
ing and go to Step [T] 3. 

Since differential savings are due to the overlapping ones 
in two rows, there is no positive differential saving if there is 
no recurrence saving. This is the reason for the termination 
condition in Step [T]4. In Steps [T]l and [T]3, we randomly 
select one transformation among those with the Id greatest 
differential savings and the Ir greatest recurrence savings re- 
spectively. There is a tradeoff between search space (and hence 
performance) and search complexity: greater Id and Ir enlarge 
the search space that may lead to greater savings at the expense 
of higher complexity. In our work, = = 2 appears enough 
for most cases. For matrices with small sizes, the additional 
complexity caused by expanding the searching space is usually 
affordable. For large matrices, we use = = 1- Since 
Algorithm [U is a randomized algorithm, the result of each run 
may vary. However, simulation results show that the variance 
between different runs is relatively small in comparison to the 
number of required additions. 

Our sequential transformation of M in Section lTl-Bl is simi- 
lar to the CSE algorithm in [26], but the algorithm in [26] does 
not take advantage of the cancelation property of characteristic- 
2 fields as Algorithm [T] With forced patterns. Algorithm [T] 
takes advantage of the cancelation property not only by dif- 
ferential savings but also by recurrence savings. Although Al- 
gorithm [T] and those in [17], [28] all take advantage of the 
cancelation property of characteristic-2 fields, they use quite 
different strategies. Algorithm [T] uses a top-down approach 
to build the addition sequence by reducing the binary matrix, 
while a bottom-up approach starting from summands was used 
in [28]. The CSE algorithm in [17] first rebuilds the binary ma- 
trix from low-weight linear combinations of rows, then reduces 
the matrix top-down using recurrence savings. Also, although 
the method in [17] takes advantage of the cancelation property 
by erasure decoding in the message passing part, it fails to do 
so in its CSE part. As we will show in Section |V] Algorithm [T| 
leads to significantly better results than the method in [17]. 

D. Fast CSE 

When the size of M is large, the time complexity of Algo- 
rithm [T] may be prohibitive. We propose several improvements 
to reduce the time complexity of Algorithm [T] 

In Algorithm [T] we restart the differential step after each 
recurrence step. But the possibility that new differential sav- 
ings emerge after we identify a pattern for recurrence saving 
is quite small. In order to reduce the complexity, we do not 
revisit the differential step after the recurrence step has ended, 
essentially decoupling the two steps. This not only reduces the 
time complexity by reducing the number of times Step [T| 1 is 
repeated, but also enables us to further accelerate both steps 
by space-time tradeoff, which will be discussed below. Note 
that our simulation results show that the decoupling of the two 
steps results in only negligible performance loss. 

Now that the differential step is standalone, it is necessary 
to avoid repeated exhaustive searches. There are only n rows 
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in M, so all possible differential saving can be put in an n x n 
array D, where Dij stands for the differential saving of the 
ordered pair of rows {ri,rj). An exhaustive search is needed 
to initialize D. Afterwards, at most 2(n — 1) entries (namely, 
the non-diagonal entries in row and column rj) of the array 
need to be updated after each ordered pair (r^ , rj ) is added to 
the dependency graph. Whenever one pair of rows is detected 
to be cycle-inducing, its differential saving will be set to -1 and 
hence it is excluded from future consideration for differential 
transformations. As the number of possible pairs decreases 
continuously, the search will be increasingly simpler 

A similar idea can be used to reduce the time complexity of 
the recurrence step. Since elimination of one pattern will only 
change a small portion of the pattern frequencies, to expedite 
searches, we store the recurrence savings and update them 
after each recurrence transformation. 

Because not all patterns exist and the number of possible 
patterns will decrease continuously, it will require less storage 
space if we keep track of only the patterns with positive recur- 
rence savings. However, this will involve an exhaustive search 
to update the pattern frequencies each time after a pattern is 
identified, which may results in high time complexity when 
the size of M is large. Instead, we keep track of all pattern 
frequencies, including those with no recurrence savings, in a 
two-dimensional array R. Suppose after the differential steps 
are over, and M' has fi columns. Initially, R is an upper trian- 
gle array with n — 1 rows, where Rij is the recurrence saving 
of the two-summand pattern Xi + Xi+j^i for Q < i < n ~ 2 
and 0<j<n — i — 2. The recurrence saving array R is 
arranged in this fashion so that frequency updates can be done 
by direct addressing without search and it is not necessary to 
remove frequencies. When a new pattern is identified, the two- 
summand pattern becomes the (n+ l)th summand. Thus, the 
frequency of the ith and the {n + l)th summands is appended 
to the ith row. Furthermore, a new row with only one element, 
the frequency of the nth and {n+ l)th summands, will be the 
last row of R. After Xi + Xj is identified as a subexpression, 
all frequencies related to Xi or Xj need to be updated. That is, 
Ri'^i-ii-i for all i' < i, Rfj^j'-i for all j' < j, 
for all i" > i, and B.jj"-j-i for all j" > j are updated 
accordingly. Furthermore, Ri j^i^i is set to zero. 

During Step[T]l, our CSE algorithm keeps only one copy of 
each row. Actually one row can have different decompositions, 
based on differential savings with different rows. To exploit 
the best differential saving for each row, a modified differential 
saving update scheme is developed. 

Let us assume that the ordered pair (rp,rc) is selected for 
differential transformation, which replaces Af^e with Af^ . 
For row (r^ ^ rp), there are two possible differential sav- 
ings: one between Mr^ and and the other between M'^ 
and M'^.. If the latter is greater, we simply update Dd- If the 
former is greater, the differential saving Dd is not changed 
and Mr^ is saved so that it can be used when Dd is selected. 
If two savings are equal, it is randomly chosen which copy to 
use. Since we may need different copies of Mr^ for each r^, 
a three-dimension array K whose entry Kij keeps a copy of 
Mvj corresponding to Dij if Af^j provides a greater differ- 
ential saving than Af^ with regard to M^- Since this can 



occur recursively, for each row at most n — 1 different rows 
may be stored in K. 

Our CSE algorithm incorporating the above improvements 
is shown in Algorithm ID 

Algorithm 2. Fast CSE 

mi Initialize the differential saving array D and K. 

1212 Find the non-cycle-inducing pairs of rows with the Id 
greatest differential savings in D, randomly choose one, 
eliminate it, and update D and K accordingly. 

|2l3 Repeat Step |2l2 until there is no positive entry in D. 

|2l4 Initialize the recurrence saving array R. 

[Us Find the patterns with the Ir greatest recurrence savings 
in R, randomly choose one, replace all occurrences of it. 
On those rows with only one entry of the pattern, force 
the pattern if it leads to less I's in the row. Update R. 

|2l6 Repeat Step |2l5 until all entries in R are zero. 

ID? If there is a reversal pattern, reverse the differential sav- 
ing, update -R, and go to Step|2l5. 

Our simulation results show that after a single run, the 
difference between the total additive complexities obtained by 
Algorithms [T] and |2] is negligible. However, the time complex- 
ity of Algorithm |2] is much smaller than that of Algorithm [T] 
For example, when Af is a 255 x 255 matrix, for a single run 
Algorithm [U needs about ten hours while Algorithm |2] finishes 
in approximately five minutes. The difference in run time is 
greater for matrices with larger sizes. Since Algorithms [T] and 
|2] are both probabilistic, the speed advantage of Algorithm |2] 
over Algorithm [T] enables us to run Algorithm [2] many more 
times, enhancing the possibility of obtaining a better result 
than using Algorithm [T] within the same amount of time. 

E. Example 

Now we provide an example of Algorithm |2] At the begin- 
ning, K is empty and 



Af = 



10 111 
11111 
110 11 
1110 



D = 







-1 



Choosing (0, 1) and adding a column corresponding the new 
summand Yq, we have 



Af(i) = 



10 1110 
1 1 
110 110 
1110 



D = 



-1 
-1 

1 





-1 
-1 






-1 







-1 



Since (1,0) is cycle-inducing, its saving is simply set to -1. 
We also set A'12 to (1, 1, 1, 1, 1) to keep track of Mi. 
Choosing (1,2), the matrices are updated as 

'10 1110 0' 

1 1 

1 1 

1110 



Af(2) = 



D 



-1 
-1 













-1 
-1 
-1 



7 



Note that (2, 0) is cycle-inducing so there is no positive 
differential saving left. 

Now we enter the recurrence transformations. The recur- 
rence saving array R for Af'^' is initialized to all zeros except 
that i?2,o = 1^ which corresponds to the pattern X2 + X-s- 
Hence Go is (0, 0, 1, 1, 0, 0, 0) and the algorithm stops at 



1 1 1 

1 1 

1 1 

1 1 



and the recurrence saving array becomes all zeros. The remain- 
ing matrix Mij needs five additions. The identified pattern 
X2 + X3 also needs one addition. So 1^ = MX can be 
calculated by six additions, whereas a straightforward imple- 
mentation of 1^ = MX requires 12 additions. Note that 
techniques such as forced patterns or reversal patterns are not 
applicable in this simple example. Nevertheless, since it can be 
easily verified that Y = MX cannot be done in five additions, 
our CSE algorithm minimizes the number of additions in this 
case. Note that if we only use recurrence savings, the result 
will be seven additions. 



F. Time and Storage Complexities 

Since the reduction of additive complexities depends on 
M only, the output of Algorithm |2] for a given CFFT can 
be used for any input vector. Hence Algorithm |2] is simply 
precomputation and its complexity should not be considered as 
part of the complexities of CFFTs. To show that Algorithm|2]is 
computationally tractable, we provide an order-of-magnitude 
analysis for the time and area complexity of Algorithm |2] 
below. 

Algorithm ID requires only four types of operations: adding 
two rows, inserting or removing entries from a row, searching 
for a two-summand pattern in a row, and comparison to find 
the greatest saving. During the optimization, while the num- 
ber of columns in the matrix M^^'^ increases continuously, 
the number of I's in each row decreases. To facilitate row 
additions, for each row we only store the positions of I's as a 
sorted list. Since the original M has n' columns, adding two 
rows is equivalent to merging two sorted lists of size at most 
n' , which requires at most 2n' comparisons. For simplicity, we 
assume inserting or removing entries in a row has the same 
complexity as adding two rows. Searching for a two-summand 
pattern in a row needs at most n' comparisons. We assume the 
complexity of either appending an entry to a row or updating 
a matrix entry is negligible. 

Now since differential transformations described in Steps |2]l, 
|2]2, and|2]3 and recurrence transformations in Steps |2]4, |2]5, 
and |2]6 are independent, we can analyze them separately. In 
Step 12] 1, the initialization of the differential saving array D 
needs to add rows for n{n — l)/2 times, so it takes at most 
n{n ~ l)n' comparisons. The result of Step|2]2 is an acyclic 
digraph with at most n nodes, so at most n{n — l)/2 pairs of 
rows are identified. To identify one pair of rows, we need at 
most 5 — 1 comparisons, where g is the number of remain- 
ing pairs of rows. After one differential saving is identified. 



the child row needs to be updated, which requires at most 
2ri' comparisons. Correspondingly, computing the differential 
savings relative to the new child row needs up to 2n'{n — 2) 
comparisons since the parent row is ineligible. So it will take 
atmost^^_„(„_-^)/2(2n'(n-2)+2n'+.g-l) « ©(n^+n^n') 
comparisons. Updating K does not requires extra computa- 
tion. Therefore the number of total comparisons for differential 
transformations is 0(n^ + n^n'). 

To initialize R, we scan the matrix row by row to find the 
recurrences of each two-summand pattern. For any row, we 
increase Rij by one if the two-summand pattern Xi+ Xi+j-i 
is present. Since there are at most n' I's in a row, it has at 
most n'{n' — l)/2 two-summand patterns and hence requires 
at most n'(n' — l)/2 additions. Thus Step|2]4 needs at most 
nn'{n' — l)/2 additions. For the first recurrence transforma- 
tion, it will need at most {n+n'){n+n' — l)nn' /2 comparisons 
to find the greatest in R, because there are {n + n'){n + 
n' — l)/2 possible two-summand patterns when all n sums in 
Y have become summands after differential transformations. 
After that, to identify each two-summand pattern, it needs 
(s + n + n')(s + n + n' — l)/2 — 1 comparisons, where s is 
the number of identified patterns. After a pattern is identified, 
all rows with the pattern need to be updated. For each pattern, 
it needs to go through at most n rows. Hence it requires at 
most 2nn' comparisons. If the pattern is forced, it needs to 
go through all identified patterns, which requires at most n' s 
comparisons for one row and nn' s comparison for n rows. It 
requires at most nn'{n' — l)/2 additions to update R. Under 
direct computation, M requires at most nn' additions. By 
identifying one pattern, the number of additions increases by 
one while saving at least one addition than direct computation. 
Based on this observation, we deduce that there are at most 
nn' 12 identified patterns. Thus the number of comparisons re- 
quired in Step|2]5 is at most 'Y^™!^'^"^ {{s-^n+n'){s-\'n+n' — 
l)/2 — l + 2nn' + nn's) « 0{n^n'^). The number of required 

additions is at most X^^^o^^"^ ("-"-' C*^' ~ ~ 0{n^n''^). 

Assuming additions have the same complexity as compar- 
isons, it is negligible. To identify one reversal pattern needs 
nn'{n' — 1), and its complexity is also negligible compared 
to those of other parts. Hence the complexity of our CSE 
algorithm is 0{n'^n'^ + n^), or 0{n^) assuming n = n'. 

The time complexity above is for one run of Algorithm |2] 
Since Algorithm |2] is probabilistic, it is necessary to run it 
multiple times to obtain good results. However, a very large 
number of runs is not necessary even for large problems, 
since the variance between different runs is relatively small 
in comparison with the total number of required additions. 

The storage complexity of Algorithm |2] includes five parts: 
M, D, R, K, and the list of identified two-summand patterns. 
For M, it is at most nn' . For D, it is n^ and can be reduced 
to n{n — 1) since Da is not necessary. Since there are at most 
nn! 12 identified patterns, the storage of R is at most (nri'/2 + 
n + n'){nn' /2 + n + n' — l)/2 and it takes at most nn' to 
keep the list of identified patterns. The three-dimensional array 
K requires at most n times of M. Hence the total storage 
complexity is at most 0{n^n'^), or 0{n'^) assuming n — n' . 

Note that the upper bound nn' /2 of the number of identified 
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patterns for an n x n' matrix is usually not tight. For example, 
for a 1023 x 1023 matrix, only less than 30,000 patterns are 
identified in our simulation. 

IV. Relations among Various CFFTs 

Our CSE algorithm can be used to reduce the additive com- 
plexities of various CFFTs. In this section, we will investigate 
their properties and establish the relations among them. This 
study also simplifies the analysis of their multiplicative and 
additive complexities as well as performance comparison in 
Section [V] 

Let us first study the properties of a block diagonal matrix 
L = dia.g{Lo, Li, . . . , Li^i), where Li's are all circulant 
matrices. Clearly, i/s are all symmetric and hence L is also 
symmetric. We formally present a result mentioned in [7] and 
[30, pp. 273], which can be proved easily by inspection. 

Lemma 1. Given L = diag(io: Li, . . . , that is a block 

diagonal matrix where Li 's are all circulant, its inverse L^^ — 
diag(i(^^, ij'^, . . . , is also a block diagonal matrix 

where L~^'s are all circulant. Furthermore, suppose Li is 
generated by ji and bi = (7^, 7^^, . . . , 7^ ' ) is a normal 
basis, then L~^ is a circulant matrix generated by (3i, where 
/??,... , 0f ' ) is the dual basis of bi. 

Thus, for DCFFTs and SCFFTs Lifi is a cyclic convolution 
and can be calculated by the bilinear form Qi{Ribi ■ Pif i) = 
Q^-Xc,-PJi) [12H15]^whereb, = (7„72,...,7r'-').For 
ICFFTs, by Lemma[T]ij ^/^ is also a cyclic convolution given 
by the biUnear form Q^{R,{13,, (3l . . . , pf''") ■ PJ,) = 
Qi{c* ■ Pifi). There are different bilinear forms of cyclic 
convolution and all of them can be used in CFFTs. Henceforth, 
we assume that the same bilinear forms (Pi's and Q/s) are 
used in all CFFTs. In this paper, we focus on the CFFTs with 
the following forms: 

DCFFT F = ALf 

= AQ{c ■ Pf) (2) 
SCFFT F' = L^A''^f' 

= P'^{c-{A'Qff') (3) 
ICFFT F" = L^A^f 

= P^(c* • Q^A- V) (4) 

where Q and P are binary matrices and usually sparse, and 
A is a dense binary square matrix. Note that the equality ^ 
is due to -L = QCP where C = diag(co, ci, . . . , c„_i); 
the equality (|4|l follows Q and is a direct application of 
Lemma [T] Due to the symmetric properties of L and L^^, 
the above CFFTs have alternative forms: DCFFTs are also 
given by F = AP^ [c ■ f ); SCFFTs are also given by 
F' ^ Q{c- {A'Q^Yf); ICFFTs are also given by F" = 
Q{c* • PA~^f). However, these alternative forms can be 
considered as the forms in (|2]i, Q, and ^ with different P 
and Q matrices. Since we assume all the bilinear forms are 
the same, we will not consider the alternative forms further. 

We observe that all CFFTs in (|2]l, ([3]i, and (|4]l are deter- 
mined by two factors. First, they all depend on the order 
of cyclotomic cosets, i.e., the coset leaders fc/s, which in 



turn determine the coset size m^'s. As in [6], we assume the 
same normal basis is used for all cyclotomic cosets of the 
same size. Hence, all CFFTs also depend on the normal basis 
selected for each subfield GF(2'"*). For simplicity, we denote 
the collections of DCFFTs, SCFFTs, and ICFFTs for different 
ki's, mi's and the normal bases as T), S, and I, respectively. 
Next, we investigate the impact on computational complexities 
of CFFTs by the two factors above. We will consider first 
multiplicative complexities and then additive complexities. 

Lemma 2. Assuming that the same bilinear forms are used, 
DCFFTs, SCFFTs, and ICFFTs as defined in ©, Q, and & 
have the same multiplicative complexities. 

Proof The multiplicative complexity is determined by 
the number of non-one entries in c in DCFFTs and SCFFTs 
or c* in ICFFTs (all elements in c or c* are non-zero). Since 
using normal bases, the number of I's in c and c* are both 
the number of all-one rows in all Ri's. Thus the multiplicative 
complexity is independent of the choices of normal bases and 
independent of the constant vectors c or c*. ■ 
The additive complexities of all CFFTs are due to the matrix- 
vector multiplications needed in CFFTs. Clearly, the number 
of additions required to compute any matrix-vector multipli- 
cation Y ~ MX varies with the implementation. In the 
following, we will consider additive complexities under direct 
computation. As pointed out in Section lTlI-CI to compute Y = 
MX by direct computation, it needs W{M) — n additions. 
In some cases the additive complexities of two matrix-vector 
multiplications can be related regardless of implementation. 
We say two matrix-vector multiplications are additively equiv- 
alent if one matrix-vector multiplication can achieve any ad- 
ditive complexity the other can, and vice versa. An important 
case of additive equivalence is given in the following lemma 
without proof. 

Lemma 3. If two binary matrices M and M' satisfy M' = 
nA^n', where H and Ti' are two permutation matrices, then 
the matrix-vector multiplications defined by M and M' are 
additively equivalent. 

With a slight abuse of terminology, we say two CFFTs are 
additively equivalent when their corresponding matrices are 
additively equivalent. By a straightforward proof, we have the 
following property: 

Lemma 4. For any two CFFTs in T) that differ only in ki 's and 

mi's, their A's and L's are additively equivalent, respectively. 
Thus, the two CFFTs in D are additively equivalent. The same 
property holds for S and X. 

We now consider additive complexities for all CFFTs when 
normal bases vary, too. 

Lemma 5. All CFFTs in T> have the same additive complexity 
under direct computation. So do those in S and X, respectively. 

Proof It suffices to prove the first part, and the arguments 
for S and X are similar. First, since different orders of cosets 
result in additively equivalent DCFFTs due to Lemma HI we 
assume the same order of cosets and consider only different 
normal bases without loss of generality. Realizing that differ- 
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ent normal bases would not change P and Q in (|2]i, we focus 
on how different normal bases impact AQ. Expressing A as 
[Aq I Ai I ■ • • I where A^ is a (2™ - 1) x mj binary 

matrix, F - Ai/' = [AqLo \ A^Li \ ■■■ \ Ai^iLi^i]f. 
For each Ai, the rows are (2™ — l)/(2™' — 1) copies of 
the set of m^-bit row vectors with all combinations except 
all zeros. Thus Ai's corresponding to different normal bases 
in GF(2'"') are equivalent up to permutation. Recall that Q 
is a block matrices for which the blocks off the diagonal 
are zero matrices and the diagonal blocks are Q/s. Thus, 
AQ = [AoQo I AiQi I • • • I Ai^iQi_^]. Thus AQ's coitc- 
sponding to different normal bases also have the same additive 
complexity under direct computation. Hence all DCFFTs in T) 
have the same additive complexity under direct computation. 

■ 

From Lemma [T] we establish a relation between T and S. 

Lemma 6. Given an ICFFT F" = L^^A^^f, there exists 
an SCFFT F' = L' Al^ j' such that L' — L^^ and A'^ and 
A^^ are equivalent up to permutation, and vice versa. 

Proof: It suffices to show the first part, and the argu- 
ment for the second part is similar. For a DCFFT given by 
F = ALUf, the transform F* = U^L^A'^f is an- 
other DFT, where F* ^ {Fo, F,,-!, F.n-2, • • ■ , i^i) = n*F 
and n* is a permutation matrix. Given an ICFFT F" = 
L^^A^^f, clearly F" ^ UF* = UWF. Suppose the in- 
dices of the components of F' = UF are in the order as 
{ko,ko2, . . . , kQ2""'-^,. . . , fc,_i2™'-i-i) ^^^^^ ^j^g 

dices of the components of F" = UF* are in the order as {n— 
ko, n - ko2, . . . , n - fco2™n-\ ...,n- fc,_i2™'-i-i) 
Note that both modulo operations above are componentwise. 
Since n — ki2^ = {n — ki)2^ mod n, F" is also ordered in cy- 
clotomic cosets. Let us consider an SCFFT with the same order 
of cyclotomic cosets: F" = L" A"'^ f" where /" = HH*/. 
Note that the order of the cyclotomic cosets sizes rrii remains 
the same in L" and i^^. Thus by Lemma[T]there exist normal 
bases such that L" — L^^. Choosing such normal bases, we 
construct an SCFFT F" = L^^A"'^/" = L^^A^^f. Thus 
L \A"^Un* - A"^)/ = for arbitrary / and full rank 
L \ Hence = A"^U*U. ■ 

Note that Lemma |6] holds regardless of implementation. 
Since this mapping exists for any ICFFTs or SCFFTs, Lemma|6] 
implies that ICFFTs and SCFFTs are additively equivalent. 

Finally, we are ready to relate the additive complexities of 
all CFFTs under direct computation. 

Lemma 7. The DCFFTs, SCFFTs, and ICFFTs in (O, Q, 
and ^ all have the same additive complexity under direct 
computation. 

Proof: Due to Lemma |5] it is sufficient to show that the 
additive complexities of two CFFTs of different types are the 
same, which holds for an SCFFT and an ICFFT by Lemmas |5] 
and |6] Now let us show it is the same for a DCFFT and an 
SCFFT. 

In length-n DCFFTs, A is an n x n matrix, Q is an nx n' 
matrix {n' > n), and P is an n' x n matrix. Under direct 
computation, the number of required additions for a DCFFT 
defined in © is W{AQ) -n + W{P) - n'. Since /' = H/, 



we have F' = A'Q{c ■ Pf), where F' = UF and A' = 
UA. For an SCFFT F' = P^ {c ■ {A'QYf), the additive 
complexity under direct computation is VF((A'(5)^) — n' + 
W{P'^)-n. Since A'Q = UAQ, so A'Q and AQ have the 
same number of I's. Since matrix transpose does not change 
the number of I's, W{(A'QY) = W{AQ) and VF(P^) = 
W{P). Hence any DCFFTs in ^ and any SCFFTs in © have 
the same additive complexity under direct computation. An 
alternative direct computation for both DCFFTs and SCFFTs 
is to multiply A and Q separately. It is easy to verify that the 
conclusion is the same. ■ 

V. CFFTs WITH Reduced Additive Complexities 

Using Algorithmic we construct CFFTs with reduced addi- 
tive complexities for lengths 2™ — 1 up to 1023, and we present 
their complexities in Table|T] CFFTs of length beyond 1023 are 
not considered because for two reasons: first, lengths beyond 
1023 are rarely needed for the primary application considered 
in this paper, Reed-Solomon decoders; second, efficient cyclic 
convolutions for CFFTs of longer lengths (for example, 11- 
point cyclic convolution for length-2047 CFFTs) are not avail- 
able in [14]-[16]. For all our CFFTs, the cyclotomic cosets 
are ordered by their leaders; for cyclic convolutions of lengths 
up to nine, we use the bilinear forms provided in [16], and 
we construct a length- 10 cyclic convolution based on those of 
lengths two and five, by the Agarwal-Cooley algorithm [31]; 
the primitive polynomials and vector-space representations in 
[32, Sec. B.3] are used for all fields; for each field, we choose 
the normal basis whose leader is the smallest power of the 
primitive element. We observe that the multiplicative com- 
plexities are the same for all CFFTs due to Lemma |2] Due 
to Lemma |6] SCFFTs and ICFFTs are additively equivalent, 
and the additive complexities of both SCFFTs and ICFFTs are 
presented together in Table |T] We also observe that SCFFTs 
and ICFFTs require more additions than DCFFTs, and the 
reason for this was given in [8]. 



TABLE I 

Complexities of Full Cyclotomic FFTs 



n 


Mult. 


Additions 


DCFFT 


SCFFT/ICFFT 


Ours 


[6] 


Ours 


[7] 


[8] 


7 


6 


24 


25 


24 


24 




15 


16 


74 


77 


76 




91 


31 


54 


299 


315 


307 






63 


97 


759 


805 


804 






127 


216 


2576 


2780 


3117 






255 


586 


6736 


7919 


6984 






511 


1014 


23130 


26643 


27192 






1023 


2827 


75360 




77276 







In Table H] we also compare the additive complexities of our 
CFFTs to those in [6]-[8], the best results of CFFTs in the 
open Uterature to our knowledg ^ In Table m some entries are 
blank due to unavailability of comparable data: the additive 
complexity of DCFFT of length 1023 is not provided in [6]; 
only length-7 ICFFT was provided in [7] and only length- 
15 SCFFTs was provided in [8]. For length-7 FFT, both our 

^A length-15 DCFFT with 76 additions was reported in [16]. 
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TABLE II 
Complexities of Full FFTs 



n 


Horner's rale 


Goertzel's alg. 


[5] 


[33] 


Bergland's alg. 


Prime-factor [2J 


Our DCFFTs 


Mult. 


Add. 


Mult. 


Add. 


Mult. 


Add. 


Mult. 


Add. 


Total 


Mult. 


Add. 


Total 


Mult. 


Add. 


Total 


Mult. 


Add. 


Total 


7 


36 


42 


12 


42 


6 


26 


29 


29 


174 








9 


37 


82 


6 


24 


54 


15 


196 


210 


38 


210 


16 


100 


41 


97 


384 














16 


74 


186 


31 


900 


930 


120 


930 


60 


388 


289 


289 


2890 








108 


612 


1584 


54 


299 


785 


63 


3844 


3906 


282 


3906 


97 


952 


801 


801 


9612 














97 


759 


1826 


127 


15876 


16002 


756 


16002 


468 


3737 


2113 


2113 


29582 














216 


2576 


5384 


255 


64516 


64770 


1718 


64770 


646 


35503 


1665 


5377 


30352 


5610 


5610 


89760 


1135 


3887 


20902 


586 


6736 


15526 


511 


260100 


260610 


4044 


260610 






13313 


13313 


239634 


39858 


39858 


717444 


6516 


17506 


128278 


1014 


23130 


40368 


1023 


1044484 


1045506 


9032 


1045506 






32257 


32257 


645140 


42966 


42966 


859320 


5915 


30547 


142932 


2827 


75360 


129073 



DCFFT and SCFFT achieve the smallest additive complexity 
of the ICFFT in [7]; for lengths 15, 31, 63, and 127, our CFFTs 
have additive complexities 4%, 5%, 6%, and 7% smaller than 
those reported in [6]; for lengths 255 and 511, our CFFTs 
reduce additive complexities by 15% and 13%, respectively, 
than their counterparts in [6]. To compare our length-7 DCFFT 
with that in [6], see Appendix lAl 

We also compare our results to other FFT algorithms in 
Table HI] For Horner's rule [34], Goertzel's algorithm [14], 
Zakharova's method [5], the complexities are reproduced from 
[6] except that the complexities of length- 1023 FFTs are repro- 
duced from [2]; the complexities of Bergland's algorithm [35] 
and the prime-factor FFTs [2] are obtained from [2], [3]. For 
reference, we also consider the algorithm proposed by Wang 
and Zhu [33], which is known to be asymptotically fast, and 
its complexities are obtained from [33, eq. (11) and (12)]. 

Since all the algorithms require both multiplicative and ad- 
ditive complexities, it is clear that a metric for the total com- 
plexities is needed for comparison. We use a weighted sum 
of the additive and multiplicative complexities as the metric, 
assuming the complexity of each multiplication is 2m — 1 
times as that of an addition. Our assumption is based on 
both hardware and software considerations. In hardware imple- 
mentation, a multiplier over GF(2'") generated by trinomials 
requires — 1 XOR and AND gates (see, e.g., [36]), 
while an adder requires m XOR gates. Assuming that XOR 
and AND gates require the same area, the area complexity of 
a field multiplier is 2m times that of an adder over GF(2™). 
In software implementation, the complexity can be measured 
by the number of word-level operations (see, for example, 
[37]). Using the shift and add method as in [37], a multiplica- 
tion requires m — 1 shift and m XOR word-level operations, 
respectively while an addition needs only one XOR word- 
level operation. Whenever the complexity of a multiplication is 
more than 2m — 1 times as complex as that of an addition (for 
example, in the hardware implementation described above), 
our assumption above underestimates the relative complexity 
of multiplications and hence puts our results in a disadvantage 
in comparison to other FFT algorithms since CFFTs have 
reduced multiplicative complexities. We would also like to 
point out the similarity between our metric and the one used 
in [33], where the multipUcation over GF(2™) was treated 2m 
times as complex as an addition. 

The total complexities of Horner's rule, Goertzel's algo- 
rithm, and [5] are not presented in Table HIl since the advantage 
in complexities of our CFFTs over Horner's rule, Goertzel's 



algorithm, and [5] is clear: our CFFTs require fewer multi- 
plications and fewer additions; the savings achieved by our 
CFFTs are very significant, and in some cases the multiplica- 
tive complexities of our CFFTs are only small fractions of 
other algorithms. We remark that the multiplicative complex- 
ities of Zakharova's method are closer to those of CFFTs, 
which is not surprising given their similarities [6]. The total 
complexities of [33], Bergland's algorithm, the prime-factor 
FFTs [2] and our CFFTs are presented in Table [III since 
in comparison to these algorithms our CFFTs have smaller 
multiplicative complexities but higher additive complexities. 
In comparison to [33], our CFFTs achieve total complexity 
savings of 69%, 52%, 73%, 81%, 82%, 49%, 83% and 80% 
for lengths 7, 15, ... , 1023, respectively. For lengths 255, 511, 
and 1023, our CFFTs achieve total complexity savings of 83%, 
94%, and 85% over Bergland's algorithm, and 26%, 69%, and 
10% over the prime-factor FFTs [2], respectively. 

We remark that, as in many previous works (see, for ex- 
ample, [6]-[8]), only the multiplications and additions are 
considered in the complexity comparison. This is reasonable 
if the CFFTs are implemented by combinational logic, and the 
required numbers of multiplications and additions translate to 
the numbers of finite field multipliers and adders in combina- 
tional logic. Under the same assumption, memory overhead 
and intermediate memory access are not considered in the 
comparison above. This would not be the case if CFFTs were 
implemented in software, but this is beyond the scope of this 
paper 
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Appendix A 
Length-7 DCFFT 

• Pre-additions p = {po,pi, . . . ,ps)'^ = Pf require 8 
additions: po = /o, P2 = f-i + /4, Ps = /i + /2, P4 = 
fi + fi, Pi = P2 + fi, P6 = h + h, P7 ^ fs + fe, 
P8 = f3 + f5, and p5 = p6 + fs- 
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Fg)^ = AQg require 



• Pointwise multiplications g — (go, 5i, • • ■ , Ss)^ = c ■ p, 
where c — (1, 1, a, a^, a^, 1, a, o? ^ ol^)^ , need 6 multi- 
plications 

• Post-additions F = (Fo,Fi, 
16 additions: io = .93 + .94, <i = .9o + .9i, ^2 = .9i + 55, 

^0 = 50 + ^2, ^3 = .92 + .94, ^4 = .98 + ^3, ^5 = .97 + ^4, 
^5 = ^1 + ^5, ^6 = .96 + ^4, ^7 = ^1+ ^6, ^6 = ^0 + ^7, 
^3 = -Pe + ^5, ^8 = ^3 + ^2, ^2 = ^3 + ^8, ^1=^2 + ^6, 

and F4 = i2 + ^7- 
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